Mapping of Sequence Reads to the Reference Genomes    ◾    83

SRR769545_mem_rmdup.bam 2> SRR769545_mem_rmdup.log

The above command removes the duplicate reads from the BAM file (paired end). If we

need paired-end reads to be treated as single end, use “-S” option.

2.4.1.7  Descriptive Statistics

Some Samtools utilities including “flagstat”, “coverage”, and “depth” can provide simple

statistics on a BAM file.

samtools flagstat SRR769545_mem_sorted.bam

samtools coverage SRR769545_mem_sorted.bam > coverage.txt

samtools depth SRR769545_mem_sorted.bam > depth.txt

For other Samtools commands, you can check the Samtools documentation which is avail-

able at “http://www.htslib.org/doc/samtools.html”.

2.5  REFERENCE-GUIDED GENOME ASSEMBLY

The reference-guided genome assembly is the use of a reference genome of an organism

as a guide to assemble a new genome. This kind of assembly is used when a genome is

re-sequenced to obtain better quality genome assembly or for variant discovery and hap-

lotype construction. The reference-guided genome assembly is widely used to sequence

the genome of individuals of the same species as of the reference genome to detect the

genotypes that may associate with certain phenotype or diseases such as cancers, viral and

bacterial variants or strains. It can also be used to assemble the genomes of closely related

species who do not have available reference genomes. Sequences of the whole genome of an

individual are used in the assembly. The workflow is the same as shown in Figure 2.13 until

the point of creating SAM/BAM file. The additional step is that the aligned reads are piled

up to create consensus sequences from the overlapped contiguous aligned reads. These con-

sensus sequences are called contigs. From these contigs, only the different bases (variants)

are used to edit the sequence of the reference genome to create a new genome sequence.

For the following practice, you can use any of the SAM files produced by BWA or

Bowtie2 above or you can run the following commands to download the FASTQ file from

the NCBI SRA database and decompress them, to download the human reference genome

from UCSC database and index it, and then to perform read mapping with Bowtie2 to

produce a SAM file:

mkdir ref_guided_ass

cd ref_guided_ass

mkdir data

fasterq-dump --verbose SRR769545

gzip SRR769545_1.fastq

gzip SRR769545_2.fastq

cd ..

mkdir ref